Kira Plastinina is a Russian brand that is sold through a defunct chain of retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines, and Armenia. The brand’s Sales and Marketing team would like to understand their customer’s behavior from data that they have collected over the past year. More specifically, they would like to learn the characteristics of customer groups.
Perform clustering stating insights drawn from your analysis and visualizations.
Provide comparisons between the approaches learned this week i.e. K-Means clustering vs Hierarchical clustering highlighting the strengths and limitations of each approach in the context of your analysis.
Be able to understand the characteristics of Kira Plastinina Customers and Classify the into by using unsupervised machine learning algorithms
Study of online shoppers for specifically the Kira Plastinina brand
shoppers <- read.csv("~/Documents/R markdowns/IP Week 13/online_shoppers_intention.csv")
head(shoppers)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 -1 0 -1
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 1 0.000000 0.20000000 0.2000000 0
## 2 2 64.000000 0.00000000 0.1000000 0
## 3 1 -1.000000 0.20000000 0.2000000 0
## 4 2 2.666667 0.05000000 0.1400000 0
## 5 10 627.500000 0.02000000 0.0500000 0
## 6 19 154.216667 0.01578947 0.0245614 0
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1 0 Feb 1 1 1 1
## 2 0 Feb 2 2 1 2
## 3 0 Feb 4 1 9 3
## 4 0 Feb 3 2 2 4
## 5 0 Feb 3 3 1 4
## 6 0 Feb 2 2 1 3
## VisitorType Weekend Revenue
## 1 Returning_Visitor FALSE FALSE
## 2 Returning_Visitor FALSE FALSE
## 3 Returning_Visitor FALSE FALSE
## 4 Returning_Visitor FALSE FALSE
## 5 Returning_Visitor TRUE FALSE
## 6 Returning_Visitor FALSE FALSE
# check the structure of Data
str(shoppers)
## 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 1 2 3 ...
## $ ProductRelated_Duration: num 0 64 -1 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : chr "Feb" "Feb" "Feb" "Feb" ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : chr "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
dim(shoppers)
## [1] 12330 18
summary(shoppers)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : -1.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 1.000 Median : 8.00 Median : 0.000
## Mean : 2.318 Mean : 80.91 Mean : 0.504
## 3rd Qu.: 4.000 3rd Qu.: 93.50 3rd Qu.: 0.000
## Max. :27.000 Max. :3398.75 Max. :24.000
## NA's :14 NA's :14 NA's :14
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : -1.00 Min. : 0.00 Min. : -1.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 185.0
## Median : 0.00 Median : 18.00 Median : 599.8
## Mean : 34.51 Mean : 31.76 Mean : 1196.0
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1466.5
## Max. :2549.38 Max. :705.00 Max. :63973.5
## NA's :14 NA's :14 NA's :14
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003119 Median :0.02512 Median : 0.000 Median :0.00000
## Mean :0.022152 Mean :0.04300 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016684 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
## NA's :14 NA's :14
## Month OperatingSystems Browser Region
## Length:12330 Min. :1.000 Min. : 1.000 Min. :1.000
## Class :character 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mode :character Median :2.000 Median : 2.000 Median :3.000
## Mean :2.124 Mean : 2.357 Mean :3.147
## 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Max. :8.000 Max. :13.000 Max. :9.000
##
## TrafficType VisitorType Weekend Revenue
## Min. : 1.00 Length:12330 Mode :logical Mode :logical
## 1st Qu.: 2.00 Class :character FALSE:9462 FALSE:10422
## Median : 2.00 Mode :character TRUE :2868 TRUE :1908
## Mean : 4.07
## 3rd Qu.: 4.00
## Max. :20.00
##
# convert all negative values to NA
shoppers[shoppers<0] <- NA
# convert into a factor
shoppers$VisitorType <- factor(shoppers$VisitorType)
head(shoppers$VisitorType)
## [1] Returning_Visitor Returning_Visitor Returning_Visitor Returning_Visitor
## [5] Returning_Visitor Returning_Visitor
## Levels: New_Visitor Other Returning_Visitor
shoppers$Weekend <- factor(shoppers$Weekend)
head(shoppers$Weekend)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
## Levels: FALSE TRUE
shoppers$Revenue <- factor(shoppers$Revenue)
head(shoppers$Revenue)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## Levels: FALSE TRUE
shoppers$Month <- factor(shoppers$Month)
head(shoppers$Month)
## [1] Feb Feb Feb Feb Feb Feb
## Levels: Aug Dec Feb Jul June Mar May Nov Oct Sep
#check for missing values
colSums(is.na(shoppers))
## Administrative Administrative_Duration Informational
## 14 47 14
## Informational_Duration ProductRelated ProductRelated_Duration
## 47 14 47
## BounceRates ExitRates PageValues
## 14 14 0
## SpecialDay Month OperatingSystems
## 0 0 0
## Browser Region TrafficType
## 0 0 0
## VisitorType Weekend Revenue
## 0 0 0
anyNA(shoppers)
## [1] TRUE
#library("Amelia")
#missmap(shoppers)
# use MICE to predict missing values
library("mice")
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
mice_mod <- mice(shoppers[, c("Administrative","Administrative_Duration","Informational","Informational_Duration","ProductRelated","ProductRelated","ProductRelated_Duration","BounceRates","ExitRates")], method='rf')
##
## iter imp variable
## 1 1 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 1 2 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 1 3 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 1 4 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 1 5 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 2 1 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 2 2 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 2 3 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 2 4 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 2 5 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 3 1 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 3 2 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 3 3 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 3 4 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 3 5 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 4 1 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 4 2 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 4 3 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 4 4 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 4 5 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 5 1 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 5 2 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 5 3 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 5 4 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## 5 5 Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates
## Warning: Number of logged events: 1
mice_complete <- complete(mice_mod)
shoppers$Administrative <- mice_complete$Administrative
shoppers$Administrative_Duration <- mice_complete$Administrative_Duration
shoppers$Informational <- mice_complete$Informational
shoppers$Informational_Duration <- mice_complete$Informational_Duration
shoppers$ProductRelated <- mice_complete$ProductRelated
shoppers$ProductRelated_Duration <- mice_complete$ProductRelated_Duration
shoppers$BounceRates <- mice_complete$BounceRates
shoppers$ExitRates <- mice_complete$ExitRates
# Check for any missing values
anyNA(shoppers)
## [1] FALSE
shoppers <- unique(shoppers)
dim(shoppers)
## [1] 12205 18
num_col <- shoppers[ ,c(1,2,3,4,5,6,7,8,9,10,12,13,14,15)]
outlier_detection = function(x){
for(i in colnames(x)){
boxplot(shoppers[[i]], xlab=i, main=paste0("Boxplot for ",i))
}
}
outlier_detection(num_col)
# replace outliers with the 5th and 95th percentile
outlier_replace <- function(x){
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
return(x)
}
shoppers$Administrative <- outlier_replace(shoppers$Administrative)
shoppers$Administrative_Duration <-outlier_replace(shoppers$Administrative_Duration)
shoppers$Informational <- outlier_replace(shoppers$Informational)
shoppers$Informational_Duration <- outlier_replace(shoppers$Informational_Duration )
shoppers$ProductRelated <- outlier_replace(shoppers$ProductRelated)
shoppers$ProductRelated_Duration <- outlier_replace(shoppers$ProductRelated_Duration)
shoppers$BounceRates <- outlier_replace(shoppers$BounceRates)
shoppers$ExitRates <- outlier_replace(shoppers$ExitRates)
shoppers$PageValues <- outlier_replace(shoppers$PageValues)
shoppers$SpecialDay <- outlier_replace(shoppers$SpecialDay)
shoppers$OperatingSystems <- outlier_replace(shoppers$OperatingSystems)
shoppers$Browser <- outlier_replace(shoppers$Browser)
shoppers$Region <- outlier_replace(shoppers$Region)
shoppers$TrafficType <- outlier_replace(shoppers$TrafficType)
# check to see if there are more outliers
outlier_detection(num_col)
From our Analysis we see that we have a total of 12330 records with 18 columns, 10 columns are numeric and 8 columns are categorical
Columns that had discrete values and are categorical like:- a) Months b) Weekend c) Revenue d) Visitor Type
They were converted to factors and all the numeric columns with negative values were replace with NA. The dataset had Missing values and they were replaced by the library MICE that predict the missing values using random forest.
There were duplicated values and there were dropped, reducing the records to 12210 records and 18 columns.
The numeric columns had outliers and capping was used by replacing the outliers with the 5th and 95th percentile, although some outliers remained. The outliers were not removed.
#### Numeric Variables
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(num_col, skew=TRUE)
## num_col
##
## 14 Variables 12205 Observations
## --------------------------------------------------------------------------------
## Administrative
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 27 0.898 2.34 3.193 0 0
## .25 .50 .75 .90 .95
## 0 1 4 7 9
##
## lowest : 0 1 2 3 4, highest: 22 23 24 26 27
## --------------------------------------------------------------------------------
## Administrative_Duration
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 3335 0.894 81.65 125.7 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 9.0 94.7 227.0 352.2
##
## lowest : 0.000000 1.333333 2.000000 3.000000 3.500000
## highest: 2407.423810 2629.253968 2657.318056 2720.500000 3398.750000
## --------------------------------------------------------------------------------
## Informational
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 17 0.516 0.5086 0.8759 0 0
## .25 .50 .75 .90 .95
## 0 0 0 2 3
##
## lowest : 0 1 2 3 4, highest: 12 13 14 16 24
##
## Value 0 1 2 3 4 5 6 7 8 9 10
## Frequency 9575 1041 727 380 222 99 78 36 14 15 7
## Proportion 0.785 0.085 0.060 0.031 0.018 0.008 0.006 0.003 0.001 0.001 0.001
##
## Value 11 12 13 14 16 24
## Frequency 1 5 1 2 1 1
## Proportion 0.000 0.000 0.000 0.000 0.000 0.000
## --------------------------------------------------------------------------------
## Informational_Duration
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 1258 0.482 34.82 64.67 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 0.0 73.9 199.0
##
## lowest : 0.000 1.000 1.500 2.000 2.500
## highest: 2166.500 2195.300 2252.033 2256.917 2549.375
## --------------------------------------------------------------------------------
## ProductRelated
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 311 0.999 32.07 36.5 2 3
## .25 .50 .75 .90 .95
## 8 18 38 74 110
##
## lowest : 0 1 2 3 4, highest: 518 534 584 686 705
## --------------------------------------------------------------------------------
## ProductRelated_Duration
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 9551 1 1208 1491 0.00 45.08
## .25 .50 .75 .90 .95
## 194.23 610.19 1477.39 2904.46 4312.96
##
## lowest : 0.000000 0.500000 1.000000 2.333333 2.666667
## highest: 24844.156200 27009.859430 29970.465970 43171.233380 63973.522230
## --------------------------------------------------------------------------------
## BounceRates
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 1872 0.908 0.02034 0.03271 0.000000 0.000000
## .25 .50 .75 .90 .95
## 0.000000 0.002899 0.016667 0.050000 0.145905
##
## lowest : 0.0000000 0.0000273 0.0000335 0.0000383 0.0000394
## highest: 0.1750000 0.1769231 0.1800000 0.1833333 0.2000000
## --------------------------------------------------------------------------------
## ExitRates
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 4777 1 0.04141 0.04219 0.004545 0.007407
## .25 .50 .75 .90 .95
## 0.014226 0.025000 0.048447 0.100000 0.170893
##
## lowest : 0.000000000 0.000175593 0.000250438 0.000262123 0.000263158
## highest: 0.183333333 0.186666667 0.188888889 0.192307692 0.200000000
## --------------------------------------------------------------------------------
## PageValues
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 2704 0.532 5.95 10.65 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.00 0.00 19.11 38.30
##
## lowest : 0.00000000 0.03803454 0.06704955 0.09354695 0.09862140
## highest: 261.49128570 270.78469310 287.95379280 360.95338390 361.76374190
## --------------------------------------------------------------------------------
## SpecialDay
## n missing distinct Info Mean Gmd
## 12205 0 6 0.277 0.06194 0.1141
##
## lowest : 0.0 0.2 0.4 0.6 0.8, highest: 0.2 0.4 0.6 0.8 1.0
##
## Value 0.0 0.2 0.4 0.6 0.8 1.0
## Frequency 10956 178 243 350 324 154
## Proportion 0.898 0.015 0.020 0.029 0.027 0.013
## --------------------------------------------------------------------------------
## OperatingSystems
## n missing distinct Info Mean Gmd
## 12205 0 8 0.828 2.124 0.8613
##
## lowest : 1 2 3 4 5, highest: 4 5 6 7 8
##
## Value 1 2 3 4 5 6 7 8
## Frequency 2549 6541 2530 478 6 19 7 75
## Proportion 0.209 0.536 0.207 0.039 0.000 0.002 0.001 0.006
## --------------------------------------------------------------------------------
## Browser
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 13 0.722 2.358 1.315 1 1
## .25 .50 .75 .90 .95
## 2 2 2 4 5
##
## lowest : 1 2 3 4 5, highest: 9 10 11 12 13
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 2427 7883 105 731 465 174 49 135 1 163 6
## Proportion 0.199 0.646 0.009 0.060 0.038 0.014 0.004 0.011 0.000 0.013 0.000
##
## Value 12 13
## Frequency 10 56
## Proportion 0.001 0.005
## --------------------------------------------------------------------------------
## Region
## n missing distinct Info Mean Gmd
## 12205 0 9 0.933 3.153 2.553
##
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##
## Value 1 2 3 4 5 6 7 8 9
## Frequency 4714 1128 2379 1171 318 801 758 431 505
## Proportion 0.386 0.092 0.195 0.096 0.026 0.066 0.062 0.035 0.041
## --------------------------------------------------------------------------------
## TrafficType
## n missing distinct Info Mean Gmd .05 .10
## 12205 0 20 0.954 4.074 3.735 1 1
## .25 .50 .75 .90 .95
## 2 2 4 11 13
##
## lowest : 1 2 3 4 5, highest: 16 17 18 19 20
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 2388 3911 2013 1066 260 443 40 343 41 450 247
## Proportion 0.196 0.320 0.165 0.087 0.021 0.036 0.003 0.028 0.003 0.037 0.020
##
## Value 12 13 14 15 16 17 18 19 20
## Frequency 1 728 13 37 3 1 10 17 193
## Proportion 0.000 0.060 0.001 0.003 0.000 0.000 0.001 0.001 0.016
## --------------------------------------------------------------------------------
histogram = function(x){
for(i in colnames(x)){
hist(shoppers[[i]], breaks = 10,main =i,xlab = i,col = "dodgerblue")
}
}
histogram(num_col)
### Bivariate #### Correlation
# Convert the Revenue Col to Numeric
shoppers$Revenue <- as.numeric(shoppers$Revenue)
nums <- shoppers[ ,c(1,2,3,4,5,6,7,8,9,10,12,13,14,15,18)]
# Correlation matrix
corr_matrix = cor(nums)
corr <- as.data.frame(round(corr_matrix,2))
corr
## Administrative Administrative_Duration Informational
## Administrative 1.00 0.77 0.37
## Administrative_Duration 0.77 1.00 0.32
## Informational 0.37 0.32 1.00
## Informational_Duration 0.37 0.32 0.94
## ProductRelated 0.44 0.33 0.38
## ProductRelated_Duration 0.39 0.34 0.37
## BounceRates -0.25 -0.20 -0.14
## ExitRates -0.35 -0.29 -0.20
## PageValues 0.35 0.30 0.23
## SpecialDay -0.11 -0.10 -0.05
## OperatingSystems 0.00 -0.01 0.00
## Browser -0.03 -0.04 -0.03
## Region 0.00 0.01 -0.02
## TrafficType -0.03 -0.02 -0.03
## Revenue 0.14 0.13 0.11
## Informational_Duration ProductRelated
## Administrative 0.37 0.44
## Administrative_Duration 0.32 0.33
## Informational 0.94 0.38
## Informational_Duration 1.00 0.37
## ProductRelated 0.37 1.00
## ProductRelated_Duration 0.37 0.85
## BounceRates -0.14 -0.25
## ExitRates -0.21 -0.37
## PageValues 0.24 0.34
## SpecialDay -0.05 -0.03
## OperatingSystems 0.00 0.03
## Browser -0.03 0.00
## Region -0.01 -0.04
## TrafficType -0.03 -0.05
## Revenue 0.11 0.17
## ProductRelated_Duration BounceRates ExitRates
## Administrative 0.39 -0.25 -0.35
## Administrative_Duration 0.34 -0.20 -0.29
## Informational 0.37 -0.14 -0.20
## Informational_Duration 0.37 -0.14 -0.21
## ProductRelated 0.85 -0.25 -0.37
## ProductRelated_Duration 1.00 -0.24 -0.34
## BounceRates -0.24 1.00 0.78
## ExitRates -0.34 0.78 1.00
## PageValues 0.34 -0.19 -0.26
## SpecialDay -0.05 0.14 0.13
## OperatingSystems 0.03 0.04 0.02
## Browser 0.01 -0.03 -0.01
## Region -0.02 -0.01 -0.01
## TrafficType -0.05 0.09 0.07
## Revenue 0.18 -0.16 -0.21
## PageValues SpecialDay OperatingSystems Browser Region
## Administrative 0.35 -0.11 0.00 -0.03 0.00
## Administrative_Duration 0.30 -0.10 -0.01 -0.04 0.01
## Informational 0.23 -0.05 0.00 -0.03 -0.02
## Informational_Duration 0.24 -0.05 0.00 -0.03 -0.01
## ProductRelated 0.34 -0.03 0.03 0.00 -0.04
## ProductRelated_Duration 0.34 -0.05 0.03 0.01 -0.02
## BounceRates -0.19 0.14 0.04 -0.03 -0.01
## ExitRates -0.26 0.13 0.02 -0.01 -0.01
## PageValues 1.00 -0.07 -0.01 0.02 -0.01
## SpecialDay -0.07 1.00 0.02 0.01 -0.02
## OperatingSystems -0.01 0.02 1.00 0.14 0.01
## Browser 0.02 0.01 0.14 1.00 0.05
## Region -0.01 -0.02 0.01 0.05 1.00
## TrafficType -0.03 0.04 0.10 0.00 0.01
## Revenue 0.60 -0.09 -0.02 0.02 -0.01
## TrafficType Revenue
## Administrative -0.03 0.14
## Administrative_Duration -0.02 0.13
## Informational -0.03 0.11
## Informational_Duration -0.03 0.11
## ProductRelated -0.05 0.17
## ProductRelated_Duration -0.05 0.18
## BounceRates 0.09 -0.16
## ExitRates 0.07 -0.21
## PageValues -0.03 0.60
## SpecialDay 0.04 -0.09
## OperatingSystems 0.10 -0.02
## Browser 0.00 0.02
## Region 0.01 -0.01
## TrafficType 1.00 0.00
## Revenue 0.00 1.00
admin <- shoppers$Administrative
admin_d <- shoppers$Administrative_Duration
info <- shoppers$Informational
info_d <- shoppers$Informational_Duration
prod <- shoppers$ProductRelated
prod_d <- shoppers$ProductRelated_Duration
exit <- shoppers$ExitRates
bounce <- shoppers$BounceRates
page <- shoppers$PageValues
# Administrative
plot(admin,exit,main = "Relationship between the Administrative and Exit Rate", xlab="Administrative", ylab="Exit Rate")
plot(admin,bounce,main = "Relationship between the Admintrative and Bounce Rate",xlab="Administrative",ylab = "Bounce Rate")
plot(admin,page, main = "Relationship between the Admintrative and Page Values",xlab="Administrative",ylab = "Page Value")
plot(admin, admin_d = "Relationship between the Administrative and Administrative Duration",xlab="Administrative",ylab = "Administrative Duration")
## Warning in plot.window(...): "admin_d" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "admin_d" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "admin_d" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "admin_d" is not a
## graphical parameter
## Warning in box(...): "admin_d" is not a graphical parameter
## Warning in title(...): "admin_d" is not a graphical parameter
# Informational
plot(info,exit,main = "Relationship between the Informational and Exit Rate",xlab = "Informational",ylab = "Exit Rates")
plot(info,bounce,main = "Relationship between the Informational and Bounce Rate",xlab = "Informational", ylab = "Bounce Rates")
plot(info,page, main = "Relationship between the Informational and Page Values",xlab = "Informational",ylab = "Page Values")
plot(info, info_d, main = "Relationship between the Informational and Informational Duration", xlab = "Informational",ylab = "Informational Duration")
# Product Related
plot(prod,exit,main = "Relationship between the Product Related and Exit Rate",xlab="Product Related",ylab = "Exit Rates")
plot(prod,bounce,main = "Relationship between the Product Related and Bounce Rate",xlab="Product Related",ylab = "Bounce Rates")
plot(prod,page, main = "Relationship between the Product Related and Page Values",xlab="Product Related",ylab = "Page Values")
plot(prod, prod_d, main = "Relationship between the Product Related and ProductRelated Duration",xlab="Product Related",ylab ="Product Related Duration")
#Duration
plot(admin_d,info_d, main = "Relationship between Admin Duration and Informational Duration", xlab = "Administrative Duration", ylab = "Informational Duration")
plot(prod_d, info_d,main = "Relationship between ProductRelated Duration and Informational Duration", xlab = "Product Related Duration",ylab = "Informational Duration")
plot(admin_d,prod_d,main = "Relationship between Admin Duration and ProductRelated Duration", xlab = "Administrative Duration", ylab = "Product Related Duration")
library(dplyr)
# how much page values do we have on weekends
by_weekend <- shoppers %>%
group_by(Weekend) %>%
summarise(pagevalues = sum(PageValues))
pv <- ggplot(by_weekend,aes(x=Weekend,y=pagevalues, fill=factor(Weekend)))+geom_bar(stat = "identity")+scale_fill_discrete(name = "Weekend", labels = c("Not Weekend","Weekend"))+ labs(title="Page Values do we have on Weekends", x="Weekend",y="Total value of Page Values")
pv
# Pages with the Highest Exit rate
## Administrative
by_admin <- shoppers %>%
group_by(Administrative)%>%
summarise(Exit = sum(ExitRates))
ad <- ggplot(by_admin,aes(x=factor(Administrative),y=Exit, fill=factor(Administrative)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Exit Rate in Administratuve", x="Administrative Pages",y=" Cumulative % of Exit Rate per Page")
ad+theme(legend.position = "none")
## Informational
by_info <- shoppers %>%
group_by(Informational)%>%
summarise(Exit = sum(ExitRates))
inf <- ggplot(by_info,aes(x=factor(Informational),y=Exit, fill=factor(Informational)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Exit Rate in Informational", x="Informational Pages",y="Cumulative % of Exit Rateper Page")
inf+theme(legend.position = "none")
## Product Related
by_prod <- shoppers %>%
group_by(ProductRelated)%>%
summarise(Exit = sum(ExitRates))
pro <- ggplot(by_prod[1:20,],aes(x=reorder(factor(ProductRelated),Exit),y=Exit, fill=factor(ProductRelated)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Exit Rate in Product Related", x="Product Related Pages",y="Cumulative % of Exit Rate per Page")
pro+theme(legend.position = "none")
# Pages with the Highest Bounce rate
## Administrative
by_admin1 <- shoppers %>%
group_by(Administrative)%>%
summarise(Bounce = sum(BounceRates))
ad1 <- ggplot(by_admin1,aes(x=factor(Administrative),y=Bounce, fill=factor(Administrative)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Bounce Rate in Administrative", x="Administrative Pages",y=" Cumulative % of Bounce Rate per Page")
ad1 +theme(legend.position = "none")
## Informational
by_info1 <- shoppers %>%
group_by(Informational)%>%
summarise(Bounce = sum(BounceRates))
inf1 <- ggplot(by_info1,aes(x=factor(Informational),y=Bounce, fill=factor(Informational)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Bounce Rate in Informational", x="Informational Pages",y=" Cumulative % of Bounce Rate per Page")
inf1+theme(legend.position = "none")
## Product Related
by_prod1 <- shoppers %>%
group_by(ProductRelated)%>%
summarise(Bounce = sum(BounceRates))
pro1 <- ggplot(by_prod1[1:20,],aes(x=reorder(factor(ProductRelated),Bounce),y=Bounce, fill=factor(ProductRelated)))+geom_bar(stat = "identity")+ labs(title="Cumulative % of Bounce Rate in Product Related",subtitle = "(Plotted for the Top 20 Pages)" ,x="Product Related Pages",y="Cumulative % Bounce Rate per Page")
pro1+theme(legend.position = "none")
# Pages with the Highest Bounce rate
## Administrative
by_admin2 <- shoppers %>%
group_by(Administrative)%>%
summarise(Page = sum(PageValues))
ad2 <- ggplot(by_admin2,aes(x=factor(Administrative),y=Page, fill=factor(Administrative)))+geom_bar(stat = "identity")+ labs(title="Total Average for Page Values in Administrative", x="Administrative Pages",y=" Total Average for Page Values per Page")
ad2 +theme(legend.position = "none")
## Informational
by_info2 <- shoppers %>%
group_by(Informational)%>%
summarise(Page = sum(PageValues))
inf2 <- ggplot(by_info2,aes(x=factor(Informational),y=Page, fill=factor(Informational)))+geom_bar(stat = "identity")+ labs(title="Total Average for Page Values in Informational", x="Informational Pages",y=" Total Average for Page Values per Page")
inf2+theme(legend.position = "none")
## Product Related
by_prod2 <- shoppers %>%
group_by(ProductRelated)%>%
summarise(Page = sum(PageValues))
pro2 <- ggplot(by_prod2[1:20,],aes(x=reorder(factor(ProductRelated),Page),y=Page, fill=factor(ProductRelated)))+geom_bar(stat = "identity")+ labs(title="Total Average forvPage Values in Product Related", x="Product Related Pages",y="Total Average for Page Values per Page")
pro2+theme(legend.position = "none")
# Exit Rates over the Months
exit_months <- shoppers %>%
group_by(Month)%>%
summarise(exit = sum(ExitRates), bounce =sum(BounceRates), pages = sum(PageValues))
#Exit Rate
mon <- ggplot(exit_months,aes(x=reorder(factor(Month),exit),y=exit, fill=factor(Month)))+geom_bar(stat = "identity")+ labs(title="Total Exit Rate Per Month", x="Months",y="Cumulative % for Exit Rate")
mon+theme(legend.position = "none")
#Bounce Rates
mon1 <- ggplot(exit_months,aes(x=reorder(factor(Month),bounce),y=bounce, fill=factor(Month)))+geom_bar(stat = "identity")+ labs(title="Total Bounce Rate Per Month", x="Months",y="Cumulative % for Bounce Rate")
mon1 + theme(legend.position = "none")
# PageValues
mon2 <- ggplot(exit_months,aes(x=reorder(factor(Month),pages),y=pages, fill=factor(Month)))+geom_bar(stat = "identity")+ labs(title="Total Average Page Values Per Month", x="Months",y="Total Average for Page Values")
mon2 + theme(legend.position = "none")
From our Analysis we observe a few interesting things that Special Days are in February and May. The site is only active 10 out of 12 months of the year excluding January and April Popular moths are:- 1. May 2. March 3. November 4. December
The months with most revenue are:- 1. November 2. May 3. December
Although the site experiencies no revenue as compared to revenue, which is a case of imbalanced data that we will deal with when modelling
Interestingly February has special days yet it is not a popular month and also doesn’t generate a lot of revenue
The visitors who come to the site are mostly Returning customers who are seen mostly in May,November, March and December.
In December and November we also observe a rare group of visitors, Others who are only seen during htis two months Also December,November and March experience highest numbers in new visitors Therefore the Months:- 1. December 2. November 3. March 4. May
Should be the focus of the Sales and Marketing team, they are the months for most likely to retain its customers and also gain new customers
There is more activity during Weekdays compared to Weekends, most revenue being acquired on Weekdays compared to Weekends, but still there is significant activity with revenue in both Weekdays and Weekends
There are 3 Traffic Types the most popular is Type 2, it most active in November,March and May In May three Traffic Types are most popular and that is Type 2,3 and 4. In March Type 1 is very active
Our Data is not Normally Distributed. From our correlation table we see the Administrative and Administrative Duration, Informational and Informational Duration, and Product Related and Product Related Duration have strong positive correlation which are 0.77,0.94,0.85 respectively.
This means the more the User access the most hidden pages the more likely there are to stay on the site.
Revenue has a strong positive correlation with Page Values, a correlation of 0.6. An increase in Page Values the higher the revenue
Bounce Rates and Exit Rates also have a strong positive correlation of 0.78, where when a page has been visited it has a likely chance to “bounced” out on. Therefore we see a lot of the pages that had high number of visitors also had a high number of Bounces. This were pages:- 1. Page 0 - Administrative 2. Page 0 - Informational 3. Page 1,2,3,4 - Product Related
This are probably the first pages the user interacts with which might explain the lack of Revenue, therefore an area to focus on for the Sales and Marketing team although the Bounce Rates are less than Exit Rates the difference, however, is not that significant
The pages that are recording high values in Page Values are:- 1.Page 0 and 9 - Administrative 2. Page 0 - Informational 3. Page 19,15,13 - Product Related
The products being viewed are in those pages you cannot easily locate
Even though the Months March, May, November and December have more Page Values and therefore Revenue, the Months December,March and November experience very high numbers of Bounce Rates.
# Transform Factors to Numeric
shoppers$Month <- as.numeric(shoppers$Month)
shoppers$VisitorType <- as.numeric(shoppers$VisitorType)
shoppers$Weekend <- as.numeric(shoppers$Weekend)
str(shoppers)
## 'data.frame': 12205 obs. of 18 variables:
## $ Administrative : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : num 1 2 1 2 10 19 1 1 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.146 0 0.146 0.146 0.02 ...
## $ ExitRates : num 0.171 0.171 0.171 0.171 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.6 0 0.6 0.6 ...
## $ Month : num 3 3 3 3 3 3 3 3 3 3 ...
## $ OperatingSystems : num 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : num 1 2 1 2 5 2 5 2 2 5 ...
## $ Region : num 1 1 8 2 1 1 3 1 2 1 ...
## $ TrafficType : num 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : num 3 3 3 3 3 3 3 3 3 3 ...
## $ Weekend : num 1 1 1 1 2 1 1 2 1 1 ...
## $ Revenue : num 1 1 1 1 1 1 1 1 1 1 ...
# K-Means is affected by Imbalanced data
table(shoppers$Revenue)
##
## 1 2
## 10297 1908
prop.table(table(shoppers$Revenue))
##
## 1 2
## 0.8436706 0.1563294
# our class label is imbalanced thus we shall balance it
# Split to train and test
library(ROSE)
## Loaded ROSE 0.0-3
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
# Make the Revenue Class a factor again
shoppers$Revenue <- as.factor(shoppers$Revenue)
# Use ROSE, it helps us to generate data synthetically as well
train.rose <- ROSE(Revenue ~ ., data = shoppers, seed = 1)$data
table(train.rose$Revenue)
##
## 1 2
## 6149 6056
feature <- train.rose[, -18]
label <- train.rose[,"Revenue"]
# Normalize the data
normalize <- function(x){
return ((x-min(x)) / (max(x)-min(x)))
}
# normalize the train
feat_norm <- as.data.frame(lapply(feature, normalize))
summary(feat_norm)
## Administrative Administrative_Duration Informational
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2802 1st Qu.:0.3304 1st Qu.:0.2940
## Median :0.3723 Median :0.4028 Median :0.3723
## Mean :0.4056 Mean :0.4348 Mean :0.4198
## 3rd Qu.:0.5083 3rd Qu.:0.5025 3rd Qu.:0.5338
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2982 1st Qu.:0.3269 1st Qu.:0.3086
## Median :0.3723 Median :0.3980 Median :0.3833
## Mean :0.4170 Mean :0.4258 Mean :0.4133
## 3rd Qu.:0.5030 3rd Qu.:0.4949 3rd Qu.:0.4862
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2771 1st Qu.:0.3056 1st Qu.:0.3040 1st Qu.:0.2505
## Median :0.3139 Median :0.3472 Median :0.4028 Median :0.2962
## Mean :0.3418 Mean :0.3771 Mean :0.4617 Mean :0.3213
## 3rd Qu.:0.3597 3rd Qu.:0.4066 3rd Qu.:0.6314 3rd Qu.:0.3465
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Month OperatingSystems Browser Region
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.4692 1st Qu.:0.3374 1st Qu.:0.3129 1st Qu.:0.3964
## Median :0.5832 Median :0.4264 Median :0.3911 Median :0.4759
## Mean :0.5570 Mean :0.4334 Mean :0.4222 Mean :0.5012
## 3rd Qu.:0.6663 3rd Qu.:0.5211 3rd Qu.:0.4868 3rd Qu.:0.5891
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## TrafficType VisitorType Weekend
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2700 1st Qu.:0.5608 1st Qu.:0.2909
## Median :0.3404 Median :0.6522 Median :0.3656
## Mean :0.3767 Mean :0.6097 Mean :0.4108
## 3rd Qu.:0.4395 3rd Qu.:0.7164 3rd Qu.:0.5130
## Max. :1.0000 Max. :1.0000 Max. :1.0000
#Apply the model
model <- kmeans(feat_norm,2, nstart = 30)
# check for no. of records in each cluster
model$size
## [1] 3677 8528
# a set of initial (distinct) cluster centers.
model$centers
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0.5274303 0.5339308 0.6159815 0.6020258
## 2 0.3530291 0.3920868 0.3352657 0.3372309
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 0.5201412 0.5135446 0.3138185 0.3376627 0.5351084
## 2 0.3851206 0.3701311 0.3538725 0.3941598 0.4300299
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1 0.3083943 0.5754530 0.4325012 0.4150062 0.4970129 0.3713982
## 2 0.3268287 0.5490706 0.4337791 0.4253063 0.5029515 0.3789528
## VisitorType Weekend
## 1 0.6346745 0.4226689
## 2 0.5989623 0.4056178
# shows the cluster where each record falls
#model$cluster
#Verfy the results
plot(feat_norm[c(1:2)], col = model$cluster)
plot(feat_norm[c(1:2)], col = label)
plot(feat_norm[c(3:4)], col = model$cluster)
plot(feat_norm[c(3:4)], col = label)
plot(feat_norm[c(5:6)], col = model$cluster)
plot(feat_norm[c(5:6)], col = label)
tb <- table(model$cluster, label)
tb
## label
## 1 2
## 1 1265 2412
## 2 4884 3644
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
paste0("Accuracy: ",accuracy(tb))
## [1] "Accuracy: 40.221220811143"
# Scale our data first
data <- shoppers[,-18]
data <- scale(data)
head(data)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 -0.768773 -0.6425037 -0.5240717 -0.4952383
## 2 -0.768773 -0.6425037 -0.5240717 -0.4952383
## 3 -0.768773 -0.6425037 -0.5240717 -0.4952383
## 4 -0.768773 -0.6425037 -0.5240717 -0.4952383
## 5 -0.768773 -0.6425037 -0.5240717 -0.4952383
## 6 -0.768773 -0.6425037 -0.5240717 -0.4952383
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1 -0.9237940 -0.8835210 2.69534563 2.6456148 -0.5367523
## 2 -0.8909011 -0.8308182 -0.49740887 2.6456148 -0.5367523
## 3 -0.9237940 -0.8835210 2.69534563 2.6456148 -0.5367523
## 4 -0.8909011 -0.8813250 2.69534563 2.6456148 -0.5367523
## 5 -0.6277577 -0.3667864 -0.05975975 0.1422596 -0.5367523
## 6 -0.3317213 -0.7565264 -0.15189640 -0.3845030 -0.5367523
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1 -0.3376272 -1.334436 -1.4324443 -1.0259045 -0.9150294 -0.78367014
## 2 -0.3376272 -1.334436 -0.1128743 -0.2161545 -0.9150294 -0.54236901
## 3 -0.3376272 -1.334436 2.5262657 -1.0259045 2.1178514 -0.30106789
## 4 -0.3376272 -1.334436 1.2066957 -0.2161545 -0.4817607 -0.05976676
## 5 -0.3376272 -1.334436 1.2066957 2.2130955 -0.9150294 -0.05976676
## 6 -0.3376272 -1.334436 -0.1128743 -0.2161545 -0.9150294 -0.30106789
## VisitorType Weekend
## 1 0.4096534 -0.5530653
## 2 0.4096534 -0.5530653
## 3 0.4096534 -0.5530653
## 4 0.4096534 -0.5530653
## 5 0.4096534 1.8079567
## 6 0.4096534 -0.5530653
#calculate the Euclidean distance
d <- dist(data, method = "euclidean")
# apply the h-clustering and use different methods
model_comp <- hclust(d, method = "complete")
model_ward <- hclust(d, method = "ward.D")
model_ward2 <- hclust(d, method = "ward.D2")
model_sing <- hclust(d, method = "single")
model_avg <- hclust(d, method = "average")
model_mc <- hclust(d, method = "mcquitty")
model_med <- hclust(d, method = "median")
model_cent <- hclust(d , method = "centroid")
# Plot Dendograms
plot(model_comp, cex = 0.6, hang = -1)
plot(model_ward, cex = 0.6, hang = -1)
plot(model_ward2, cex = 0.6, hang = -1)
plot(model_sing, cex = 0.6, hang = -1)
plot(model_avg, cex = 0.6, hang = -1)
plot(model_mc, cex = 0.6, hang = -1)
plot(model_med, cex = 0.6, hang = -1)
plot(model_cent, cex = 0.6, hang =-1)
The two Clustering methods, K-Means Clustering achieved an accuracy of 39% while for the Hierarchical Clustering the model that uses the ward.D2 method best clusters. It forms clusters better than K-Means because we can see it forming a finally branch on the dendogram
The challenge with Hierarchical Clustering you cannot view the variables and how they formed the cluster, also can measure it performance and compare it with K-Means.
In terms of being able to help the Client understand their Customers’ behaviors we can use the Hierarchical Clustering although we need another model to measure on performance and classify a customer as likely to generate Revenue or not
From the two Algorithm we can see it there were not sufficient in giving us performance, K-Means performed poorly while Hierarchical Clustering was able to form clusters but there is no way to measure performance
Therefore I believe we can explore on other Unsupervised Machine Learning Models like Neural Networks